{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Creating a QA bot from technical documentation\n",
        "\n",
        "This notebook demonstrates how to create a chatbot (single turn) that answers user questions based on technical documentation made available to the model.\n",
        "\n",
        "We use the `aws-documentation` dataset ([link](https://github.com/siagholami/aws-documentation/tree/main)) for representativeness. This dataset contains 26k+ AWS documentation pages, preprocessed into 120k+ chunks, and 100 questions based on real user questions.\n",
        "\n",
        "We proceed as follows:\n",
        "1. Embed the AWS documentation into a vector database using Cohere embeddings and `llama_index`\n",
        "2. Build a retriever using Cohere's `rerank` for better accuracy, lower inference costs and lower latency\n",
        "3. Create model answers for the eval set of 100 questions\n",
        "4. Evaluate the model answers against the golden answers of the eval set\n"
      ],
      "metadata": {
        "id": "d5-gWUEqG4-T"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Setup"
      ],
      "metadata": {
        "id": "JY2wHt3AG7X0"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wajdtJmNG0Uy"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!pip install cohere datasets llama_index llama-index-llms-cohere llama-index-embeddings-cohere"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import cohere\n",
        "import datasets\n",
        "from llama_index.core import StorageContext, VectorStoreIndex, load_index_from_storage\n",
        "from llama_index.core.schema import TextNode\n",
        "from llama_index.embeddings.cohere import CohereEmbedding\n",
        "import pandas as pd\n",
        "\n",
        "import json\n",
        "from pathlib import Path\n",
        "from tqdm import tqdm\n",
        "from typing import List\n"
      ],
      "metadata": {
        "id": "kJ_9dyJxG0zA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Set up Cohere client\n",
        "api_key = \"\" # <your API key>\n",
        "co = cohere.Client(api_key=api_key)"
      ],
      "metadata": {
        "id": "EECuI7OdIy8M"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 1. Embed technical documentation and store as vector database\n",
        "\n",
        "* Load the dataset from HuggingFace\n",
        "* Compute embeddings using Cohere's implementation in LlamaIndex, `CohereEmbedding`\n",
        "* Store inside a vector database, `VectorStoreIndex` from LlamaIndex\n",
        "\n",
        "\n",
        "Because this process is lengthy (~2h for all documents on a MacBookPro), we store the index to disc for future reuse. We also provide a (commented) code snippet to index only a subset of the data. If you use this snippet, bear in mind that many documents will become unavailable to the model and, as a result, performance will suffer!"
      ],
      "metadata": {
        "id": "u_UfRVoBIHmD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "data = datasets.load_dataset(\"sauravjoshi23/aws-documentation-chunked\")\n",
        "print(data)\n",
        "# The data comes prechunked. We keep the data as-is in this notebook.\n",
        "# For more information on optimal preprocessing strategies, please check\n",
        "# our other notebooks!\n",
        "\n",
        "# Build a mapping from sample id to index inside data (will be useful for retrieval later)\n",
        "map_id2index = {sample[\"id\"]: index for index, sample in enumerate(data[\"train\"])}\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "NoXezqvoG02f",
        "outputId": "cabfccf3-3c15-4955-dab9-e2734fa65a7d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: \n",
            "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
            "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
            "You will be able to reuse this secret in all of your notebooks.\n",
            "Please note that authentication is recommended but still optional to access public models or datasets.\n",
            "  warnings.warn(\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "DatasetDict({\n",
            "    train: Dataset({\n",
            "        features: ['id', 'text', 'source'],\n",
            "        num_rows: 187147\n",
            "    })\n",
            "})\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Create index in vector database, and persist it for later reuse\n",
        "# Note: this cell takes about ~2h on a MacBookPro\n",
        "\n",
        "overwrite = True # only compute index if it doesn't exist\n",
        "path_index = Path(\".\") / \"aws-documentation_index_cohere\"\n",
        "\n",
        "# Select Cohere's new `embed-english-v3.0` as the engine to compute embeddings\n",
        "embed_model = CohereEmbedding(\n",
        "    cohere_api_key=api_key,\n",
        "    model_name=\"embed-english-v3.0\",\n",
        ")\n",
        "\n",
        "if not path_index.exists() or overwrite:\n",
        "    # Documents are prechunked. Keep them as-is for now\n",
        "    stub_len = len(\"https://github.com/siagholami/aws-documentation/tree/main/documents/\")\n",
        "    documents = [\n",
        "        # -- for indexing full dataset --\n",
        "        TextNode(\n",
        "            text=sample[\"text\"],\n",
        "            title=sample[\"source\"][stub_len:], # save source minus stub\n",
        "            id_=sample[\"id\"],\n",
        "        ) for sample in data[\"train\"]\n",
        "        # -- for testing on subset --\n",
        "        # TextNode(\n",
        "        #     text=data[\"train\"][index][\"text\"],\n",
        "        #     title=data[\"train\"][index][\"source\"][stub_len:],\n",
        "        #     id_=data[\"train\"][index][\"id\"],\n",
        "        # ) for index in range(1_000)\n",
        "    ]\n",
        "    index = VectorStoreIndex(documents, embed_model=embed_model)\n",
        "    index.storage_context.persist(path_index)\n",
        "\n",
        "else:\n",
        "    storage_context = StorageContext.from_defaults(persist_dir=path_index)\n",
        "    index = load_index_from_storage(storage_context, embed_model=embed_model)\n"
      ],
      "metadata": {
        "id": "tvgtKDBTG05h"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 2. Build a retriever using Cohere's `rerank`\n",
        "\n",
        "The vector database we built using `VectorStoreIndex` comes with an in-built retriever. We can call that retriever to fetch the top $k$ documents most relevant to the user question with:\n",
        "\n",
        "```python\n",
        "retriever = index.as_retriever(similarity_top_k=top_k)\n",
        "```\n",
        "\n",
        "We recently released [Rerank-3](https://txt.cohere.com/rerank-3/) (April '24), which we can use to improve the quality of retrieval, as well as reduce latency and the cost of inference. To use the retriever with `rerank`, we create a thin wrapper around `index.as_retriever` as follows:"
      ],
      "metadata": {
        "id": "sS2yzRwOKHTv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "class RetrieverWithRerank:\n",
        "    def __init__(self, retriever, api_key):\n",
        "        self.retriever = retriever\n",
        "        self.co = cohere.Client(api_key=api_key)\n",
        "\n",
        "    def retrieve(self, query: str, top_n: int):\n",
        "        # First call to the retriever fetches the closest indices\n",
        "        nodes = self.retriever.retrieve(query)\n",
        "        nodes = [\n",
        "            {\n",
        "                \"text\": node.node.text,\n",
        "                \"llamaindex_id\": node.node.id_,\n",
        "            }\n",
        "            for node\n",
        "            in nodes\n",
        "        ]\n",
        "        # Call co.rerank to improve the relevance of retrieved documents\n",
        "        reranked = self.co.rerank(query=query, documents=nodes, model=\"rerank-english-v3.0\", top_n=top_n)\n",
        "        nodes = [nodes[node.index] for node in reranked.results]\n",
        "        return nodes\n",
        "\n",
        "\n",
        "top_k = 60 # how many documents to fetch on first pass\n",
        "top_n = 20 # how many documents to sub-select with rerank\n",
        "\n",
        "# Instantiate retriver\n",
        "retriever = RetrieverWithRerank(\n",
        "    index.as_retriever(similarity_top_k=top_k),\n",
        "    api_key=api_key,\n",
        ")\n"
      ],
      "metadata": {
        "id": "Wy_BGGbGG08w"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Test the retriever on a single question!\n",
        "query = \"What happens to my Amazon EC2 instances if I delete my Auto Scaling group?\"\n",
        "\n",
        "# Retrieving relevant documents with rerank now fits in one line\n",
        "documents = retriever.retrieve(query, top_n=top_n)\n",
        "\n",
        "# Call Cohere's RAG pipeline with co.chat and the `documents` argument\n",
        "resp = co.chat(message=query, model=\"command-r\", temperature=0., documents=documents)\n",
        "print(resp.text)\n"
      ],
      "metadata": {
        "id": "ODihI1YCG0_5"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "This works! With `co.chat`, you get the additional benefit that citations are returned for every span of text. Here's a simple function to display the citations inside square brackets."
      ],
      "metadata": {
        "id": "tHTKtvMTMLhz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def build_answer_with_citations(response):\n",
        "    \"\"\" \"\"\"\n",
        "    text = response.text\n",
        "    citations = response.citations\n",
        "\n",
        "    # Construct text_with_citations adding citation spans as we iterate through citations\n",
        "    end = 0\n",
        "    text_with_citations = \"\"\n",
        "\n",
        "    for citation in citations:\n",
        "        # Add snippet between last citatiton and current citation\n",
        "        start = citation.start\n",
        "        text_with_citations += text[end : start]\n",
        "        end = citation.end  # overwrite\n",
        "        citation_blocks = \" [\" + \", \".join([stub[4:] for stub in citation.document_ids]) + \"] \"\n",
        "        text_with_citations += text[start : end] + citation_blocks\n",
        "    # Add any left-over\n",
        "    text_with_citations += text[end:]\n",
        "\n",
        "    return text_with_citations\n",
        "\n",
        "grounded_answer = build_answer_with_citations(resp)\n",
        "print(grounded_answer)\n"
      ],
      "metadata": {
        "id": "7fH5mv2PG1DI"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 3. Create model answers for 100 QA pairs\n",
        "\n",
        "Now that we have a running pipeline, we need to assess its performance.\n",
        "\n",
        "The author of the repository provides 100 QA pairs that we can test the model on. Let's download these questions, then run inference on all 100 questions. Later, we will use Command-R+ -- Cohere's largest and most powerful model -- to measure performance."
      ],
      "metadata": {
        "id": "f1mJdBQYRI4k"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Load data from github\n",
        "url = \"https://github.com/siagholami/aws-documentation/blob/main/QA_true.csv?raw=true\"\n",
        "qa_pairs = pd.read_csv(url)\n",
        "qa_pairs.sample(2)\n"
      ],
      "metadata": {
        "id": "jWkBVKSvMlip"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We'll use the fields as follows:\n",
        "* `Question`: the user question, passed to `co.chat` to generate the answer\n",
        "* `Answer_True`: treat as the ground gruth; compare to the model-generated answer to determine its correctness\n",
        "* `Document_True`: treat as the (single) golden document; check the rank of this document inside the model's retrieved documents"
      ],
      "metadata": {
        "id": "eS_IDKnvcI40"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We'll loop over each question and generate our model answer. We'll also complete two steps that will be useful for evaluating our model next:\n",
        "1. We compute the rank of the golden document amid the retrieved documents -- this will inform how well our retrieval system performs\n",
        "2. We prepare the grading prompts -- these will be sent to an LLM scorer to compute the goodness of responses"
      ],
      "metadata": {
        "id": "bi4gH235eMzA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define the LLM eval prompt\n",
        "# We request a score and a reason for assigning that score to 'trigger CoT' and\n",
        "# improve the model response\n",
        "\n",
        "LLM_EVAL_TEMPLATE = \"\"\"## References\n",
        "{references}\n",
        "\n",
        "QUESTION: based on the above reference documents, answer the following question: {question}\n",
        "ANSWER: {answer}\n",
        "STUDENT RESPONSE: {completion}\n",
        "\n",
        "Based on the question and answer above, grade the studen't reponse. A correct response will contain exactly \\\n",
        "the same information as in the answer, even if it is worded differently. If the student's reponse is correct, \\\n",
        "give it a score of 1. Otherwise, give it a score of 0. Let's think step by step. Return your answer as \\\n",
        "as a compilable JSON with the following structure:\n",
        "{{\n",
        "    \"reasoning\": <reasoning>,\n",
        "    \"score: <score of 0 or 1>,\n",
        "}}\"\"\"\n",
        "\n",
        "\n",
        "def get_rank_of_golden_within_retrieved(golden: str, retrieved: List[dict]) -> int:\n",
        "    \"\"\"\n",
        "    Returns the rank that the golden document (single) has within the retrieved documents\n",
        "    * `golden` contains the source of the document, e.g. 'amazon-ec2-user-guide/EBSEncryption.md'\n",
        "    * `retrieved` has a list of responses with key 'llamaindex_id', which links back to document sources\n",
        "    \"\"\"\n",
        "    # Create {document: rank} map using llamaindex_id (count first occurrence of any document; they can\n",
        "    # appear multiple times because they're chunked)\n",
        "    doc_to_rank = {}\n",
        "    for rank, doc in enumerate(retrieved):\n",
        "        # retrieve source of document\n",
        "        _id = doc[\"llamaindex_id\"]\n",
        "        source = data[\"train\"][map_id2index[_id]][\"source\"]\n",
        "        # format as in dataset\n",
        "        source = source[stub_len:]  # remove stub\n",
        "        source = source.replace(\"/doc_source\", \"\")  # remove /doc_source/\n",
        "        if source not in doc_to_rank:\n",
        "            doc_to_rank[source] = rank + 1\n",
        "\n",
        "    # Return rank of `golden`, defaulting to len(retrieved) + 1 if it's absent\n",
        "    return doc_to_rank.get(golden, len(retrieved) + 1)\n"
      ],
      "metadata": {
        "id": "rR67DcP5epAV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from tqdm import tqdm\n",
        "\n",
        "answers = []\n",
        "golden_answers = []\n",
        "ranks = []\n",
        "grading_prompts = []  # best computed in batch\n",
        "\n",
        "for _, row in tqdm(qa_pairs.iterrows(), total=len(qa_pairs)):\n",
        "    query, golden_answer, golden_doc = row[\"Question\"], row[\"Answer_True\"], row[\"Document_True\"]\n",
        "    golden_answers.append(golden_answer)\n",
        "\n",
        "    # --- Produce answer using retriever ---\n",
        "    documents = retriever.retrieve(query, top_n=top_n)\n",
        "    resp = co.chat(message=query, model=\"command-r\", temperature=0., documents=documents)\n",
        "    answer = resp.text\n",
        "    answers.append(answer)\n",
        "\n",
        "    # --- Do some prework for evaluation later ---\n",
        "    # Rank\n",
        "    rank = get_rank_of_golden_within_retrieved(golden_doc, documents)\n",
        "    ranks.append(rank)\n",
        "    # Score: construct the grading prompts for LLM evals, then evaluate in batch\n",
        "    # Need to reformat documents slightly\n",
        "    documents = [{\"index\": str(i), \"text\": doc[\"text\"]} for i, doc in enumerate(documents)]\n",
        "    references_text = \"\\n\\n\".join(\"\\n\".join([f\"{k}: {v}\" for k, v in doc.items()]) for doc in documents)\n",
        "    # ^ snippet looks complicated, but all it does it unpack all kwargs from `documents`\n",
        "    # into text separated by \\n\\n\n",
        "    grading_prompt = LLM_EVAL_TEMPLATE.format(\n",
        "        references=references_text, question=query, answer=golden_answer, completion=answer,\n",
        "    )\n",
        "    grading_prompts.append(grading_prompt)\n"
      ],
      "metadata": {
        "id": "tAg3MTOcMll4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 4. Evaluate model performance\n",
        "\n",
        "We want to test our model performance on two dimensions:\n",
        "1. How good is the final answer? We'll compare our model answer to the golden answer using Command-R+ as a judge.\n",
        "2. How good is the retrieval? We'll use the rank of the golden document within the retrieved documents to this end.\n",
        "\n",
        "Note that this pipeline is for illustration only. To measure performance in practice, we would want to run more in-depths tests on a broader, representative dataset."
      ],
      "metadata": {
        "id": "2BCs7ozyO1vL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# For simplicity, prepare a DataFrame with the results\n",
        "results = pd.DataFrame()\n",
        "results[\"answer\"] = answers\n",
        "results[\"golden_answer\"] = qa_pairs[\"Answer_True\"]\n",
        "results[\"rank\"] = ranks\n"
      ],
      "metadata": {
        "id": "1qBvMpYmQDTK"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 4.1 Compare answer to golden answer\n",
        "\n",
        "We'll use Command-R+ as a judge of whether the answers produced by our model convey the same information as the golden answers. Since we've defined the grading prompts earlier, we can simply ask our LLM judge to evaluate that grading prompt. After a little bit of postprocessing, we can then extract our model scores."
      ],
      "metadata": {
        "id": "8Wglx0dsQtgX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "scores = []\n",
        "reasonings = []\n",
        "\n",
        "def remove_backticks(text: str) -> str:\n",
        "  \"\"\"\n",
        "  Some models are trained to output JSON in Markdown formatting:\n",
        "  ```json {json object}```\n",
        "  Remove the backticks from those model responses so that they become\n",
        "  parasable by json.loads.\n",
        "  \"\"\"\n",
        "  if text.startswith(\"```json\"):\n",
        "      text = text[7:]\n",
        "  if text.endswith(\"```\"):\n",
        "      text = text[:-3]\n",
        "  return text\n",
        "\n",
        "\n",
        "for prompt in tqdm(grading_prompts, total=len(grading_prompts)):\n",
        "  resp = co.chat(message=prompt, model=\"command-r-plus\", temperature=0.)\n",
        "  # Convert response to JSON to extract the `score` and `reasoning` fields\n",
        "  # We remove backticks for compatibility with different LLMs\n",
        "  parsed = json.loads(remove_backticks(resp.text))\n",
        "  scores.append(parsed[\"score\"])\n",
        "  reasonings.append(parsed[\"reasoning\"])\n"
      ],
      "metadata": {
        "id": "_F3V4E56Q1CE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Add scores to our DataFrame\n",
        "results[\"score\"] = scores\n",
        "results[\"reasoning\"] = reasonings"
      ],
      "metadata": {
        "id": "1xksywOeQ1IJ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "print(f\"Average score: {results['score'].mean():.3f}\")\n"
      ],
      "metadata": {
        "id": "l8Z2w1ERSeHZ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 4.2 Compute rank\n",
        "\n",
        "We've already computed the rank of the golden documents using `get_rank_of_golden_within_retrieved`. Here, we'll plot the histogram of ranks, using blue when the answer scored a 1, and red when the answer scored a 0."
      ],
      "metadata": {
        "id": "ob-Km6dAPeHJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "\n",
        "sns.set_theme(style=\"darkgrid\", rc={\"grid.color\": \".8\"})\n",
        "\n",
        "# create offsets for better vis\n",
        "results[\"rank_shifted_left\"] = results[\"rank\"] - 0.1\n",
        "results[\"rank_shifted_right\"] = results[\"rank\"] + 0.1\n",
        "\n",
        "f, ax = plt.subplots(figsize=(5, 3))\n",
        "sns.histplot(data=results.loc[results[\"score\"] == 1], x=\"rank_shifted_left\", color=\"skyblue\", label=\"Correct answer\", binwidth=1)\n",
        "sns.histplot(data=results.loc[results[\"score\"] == 0], x=\"rank_shifted_right\", color=\"red\", label=\"False answer\", binwidth=1)\n",
        "\n",
        "ax.set_xticks([1, 5, 0, 10, 15, 20])\n",
        "ax.set_title(\"Rank of golden document (max means golden doc. wasn't retrieved)\")\n",
        "ax.set_xlabel(\"Rank\")\n",
        "ax.legend();\n"
      ],
      "metadata": {
        "id": "SLn3s3n_MlpO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We see that retrieval works well overall: for 80% of questions, the golden document is within the top 5 documents. However, we also notice that approx. half the false answers come from instances where the golden document wasn't retrieved (`rank = top_k = 20`). This should be improved, e.g. by adding metadata to the documents such as their section headings, or altering the chunking strategy.\n",
        "\n",
        "There is also a non-negligible instance of false answers where the top document was retrieved. On closer inspection, many of these are due to the model phrasing its answers more verbosely than the (very laconic) golden documents. This highlights the importance of checking eval results before jumping to conclusions about model performance."
      ],
      "metadata": {
        "id": "PZeTt7ijVKxo"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Conclusions\n",
        "\n",
        "In this notebook, we've built a QA bot that answers user questions based on technical documentation. We've learnt:\n",
        "\n",
        "1. How to embed the technical documentation into a vector database using Cohere embeddings and `llama_index`\n",
        "2. How to build a custom retriever that leverages Cohere's `rerank`\n",
        "3. How to evaluate model performance against a predetermined set of golden QA pairs\n",
        "\n"
      ],
      "metadata": {
        "id": "4Q5i8EYDV_da"
      }
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "UC2FKrkSWcPn"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}